Data Ingestion from SFTP to Snowflake
You can create a data ingestion pipeline using SFTP data source and push the data into a Snowflake data lake. The file formats supported for SFTP are CSV, JSON, XLSX, PARQUET. The data integration tool used is Databricks.
-
Create a data pipeline with the following nodes: SFTP > Databricks > Snowflake
-
Configure the SFTP and Snowflake nodes.
-
Click the Databricks node and click Create Job.
-
Complete the following steps to create the Databricks data integration job:

-
Template - this is automatically selected depending on the stages selected.
-
Job Name - Provide a name for the data analyzer job.
-
Node rerun Attempts - Specify the number of times the job is rerun in case of failure. The default setting is done at the pipeline level.
Click Next.

The source is pre-selected based on the SFTP instance that you have configured. The data format that is currently supported is Parquet.
Decide whether you want to partition the data when loading it into the target data lake. Partitioning the data helps to manage and query your data easily. Select from the following options:
-
Source - the source is preselected based on the configuration that is selected.
-
Datastore - the datastore is preselected based on the configuration that is selected.
-
Select Path - select the folder from the SFTP folder structure.
-
File Format - one of the following file formats is selected based on the data source node that is configured.
-
CSV
-
JSON
-
XLSX
-
PARQUET
-
-
Select File - browse to select the required file and click Add File.
-
Path - the path is displayed based on the folder structure that you selected.
Click Next.

-
Target - the target is preselected based on the selected configuration.
-
You can select one of the following options:
-
Configured Datastore
-
New Datastore
-
-
Datastore - the datastore is preselected based on the selected configuration.
-
Warehouse - the warehouse is preselected.
-
Database - the database is preselected.
-
Schema - the schema is preselected.
Click Next.

-
Map source data to target tables - map the source data to the tables in the target. Some important points to note while mapping the source data to target tables:
-
If the source is an Excel file it may consist of multiple sheets. The target table name that you provide will be treated as an audit table. Lazsa will generate the audit table with the required schema. Along with that it will generate individual tables for each sheet in the Excel file. You can provide names for the sheets according to your requirement. For example, if your Excel file contains three sheets and the table name is "Sales," then following tables will be created:
-
Sales (Or whatever name you provide to this table. This table will contain the audit logs and other information like the source path, the sheets in the file and the records processed.)
-
<name of Sheet1>
-
<name of Sheet2>
-
<name of Sheet3>
-
-
If the path selected for the source is at the folder level, and the target table does not have a column named LAZSA_AUDITING_COL, then the a column with that name is created after the job run.
For source nodes Amazon S3 and SFTP the data type for LAZSA_AUDITING_COL must be timestamp.
Sample code for adding LAZSA_AUDITING_COL:
Source Node Sample code Amazon S3 CopyALTER TABLE <TABLE_NAME> ADD COLUMN LAZSA_AUDITING_COL NUMBER(38,0);
SFTP CopyALTER TABLE <TABLE_NAME> ADD COLUMN LAZSA_AUDITING_COL TIMESTAMP_NTZ(9);
-
Click Next.

You can select an all-purpose cluster or a job cluster to run the configured job. In case your Databricks cluster is not created through the Lazsa Platform and you want to update custom environment variables, refer to the following:
-
Select an All Purpose Cluster - this is already configured. Select one from the dropdown.
Note:
If you do not see a cluster configuration in the dropdown list, it is possible that the configured Databricks cluster has been deleted. In this case, you must create a new Databricks cluster configuration in the Data Integration section of Cloud Platform Tools and Technologies. Delete the data integration node from the data pipeline, add a new node with the newly created configuration, and configure the job again. Now you can select the newly configured Databricks cluster.
-
Job Cluster - Provide the required details to create a job cluster.
Click Complete.

SQN and SNS
-
Configurations - Select an SQS or SNS configuration that is integrated with the Lazsa Platform
-
Events - Select the events for which you want to enable SQS or SNS queues.
-
Select All
-
Node Execution Failed
-
Node Execution Succeeded
-
Node Execution Running
-
Node Execution Rejected
-
-
Event Details - Select the details of the events for which notifications are enabled.
-
Additional Parameters - provide any additional parameters to be considered for SQS and SNS queues.
Click Complete.
What's next?Snowflake Bulk Ingest with Storage Integration |